Can we use attributes and metrics of songs to predict their likelihood of winning a Grammy?

What common traits, if any, do award winning songs contain? Can we look at intrinsic traits of songs, combined with metrics defined by Spotify, to determine award winning musical features? In this paper, we break down our data collection, data processing, and data analysis of a dataset of roughly 1,000 popular songs, both award-winning and not.

The Data

Data Collection and Selection

JOAN TO DO: Write how we chose the 1,000 songs originally; how we ended up with 867; and which Grammy winners we stuck with (only Grammy winners, or also Grammy nominees?)

# Set the working directory to this file's folder
library("rstudioapi")
setwd(dirname(getActiveDocumentContext()$path))
load("final_df_n_str.RData")

Sys.setenv(LANG = "en") 

# Load necessary libraries
library(pROC)
library(MASS)
library(ROSE)
library(confintr)
library(ggplot2)
library(correlation)
library(corrplot)
library(class)
library(caret)
library(glmnet)
# Selecting the relevant variables
data = final_df_n_str
data = data[,c("track_name", "artist_name", "IsWinner", "Year","year",
               "followers", "acousticness", "danceability", "duration_ms",
               "energy", "instrumentalness", "key", "liveness", "loudness",
               "mode", "tempo", "time_signature", "valence")]

# Merge the two year variable
data$Year[data$Year == "Undefined"] <- data$year[data$Year == "Undefined"]
data = data[,c("track_name","artist_name", "IsWinner", "Year", "followers",
               "acousticness", "danceability", "duration_ms",
               "energy", "instrumentalness", "key", "liveness", "loudness",
               "mode", "tempo", "time_signature", "valence")]

# Eliminating duplicates
data$track_name == "Closing Time"
data$track_name == "Smells Like Teen Spirit"
data$track_name == "Don't Wanna Fight"
data[914, ]
data[789,]
data[669,]

data = data[-c(669, 789, 914),]

sum(data$Year < 1992)
nrow(data)
data = data[!data$Year < 1992,]

# Creating row names

names = paste0(data$track_name, " - ", data$artist_name)

# Eliminating unusable variables
data = data[,c("IsWinner", "Year", "followers", "acousticness",
               "danceability", "duration_ms", "energy",
               "instrumentalness", "key", "liveness", "loudness", "mode",
               "tempo", "time_signature", "valence")]
data = cbind(names = names, data)

# Casting variables
data$IsWinner[data$IsWinner == "Winner"] = 1
data$IsWinner[data$IsWinner == "Nominee"] = 1
data$IsWinner[data$IsWinner == "Nothing"] = 0
data$IsWinner = as.integer(data$IsWinner)
data$Year = as.integer(data$Year)
data$mode = as.factor(data$mode)
data$key = as.factor(data$key)
data$time_signature = as.factor(data$time_signature)

# Giving row names
summary(data)
summary(data$IsWinner)

Explanation of Variables

In order to perform analysis of songs, we decided to use metrics that are intrinsic to music as well as artificial metrics created and measured by the music streaming giant Spotify. The intrinsic metrics we used were: duration, musical key, modality (major or minor key), tempo, time signature, and genre. Spotify also uses what they call “audio features” (in the table below) to perform their own analysis of songs when creating playlists, suggesting music, etc. We used these professionally manufactured metrics to bolster the intrinsic metrics and increase our insight into what might make a song award-winning.

Audio Feature Definition
Acousticness A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
Danceability Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
Energy Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
Instrumentalness Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
Liveness Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
Loudness The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.
Speechiness Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
Valence A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Although Spotify does not openly share how they determine these metrics, we found them suitable to assist in our analysis.

As a final processing step, we split the data into training and test datasets. The training dataset contains 80% of the original dataset, and the remaining 20% of the data in the test dataset will be used to test tour model against after we have trained it. It is very important to test the model on never-before-seen data to determine not only how well the model performs, but also how well the model can generalize.

# Splitting training and test set
training_size = floor(0.8 * nrow(data))
set.seed(42)
train_ind = sample(seq_len(nrow(data)), size = training_size)
training_set = data[train_ind,]
test_set = data[-train_ind,]

summary(training_set)
##     names              IsWinner           Year        followers       
##  Length:693         Min.   :0.0000   Min.   :1992   Min.   :    2597  
##  Class :character   1st Qu.:0.0000   1st Qu.:2001   1st Qu.:  868777  
##  Mode  :character   Median :0.0000   Median :2010   Median : 2350118  
##                     Mean   :0.1876   Mean   :2009   Mean   : 4338356  
##                     3rd Qu.:0.0000   3rd Qu.:2018   3rd Qu.: 5615666  
##                     Max.   :1.0000   Max.   :2023   Max.   :44692754  
##                                                                       
##   acousticness        danceability    duration_ms          energy      
##  Min.   :0.0000032   Min.   :0.130   Min.   :  78591   Min.   :0.0975  
##  1st Qu.:0.0016900   1st Qu.:0.419   1st Qu.: 206413   1st Qu.:0.6040  
##  Median :0.0278000   Median :0.522   Median : 237800   Median :0.7570  
##  Mean   :0.1553733   Mean   :0.512   Mean   : 251635   Mean   :0.7182  
##  3rd Qu.:0.2050000   3rd Qu.:0.607   3rd Qu.: 278267   3rd Qu.:0.8820  
##  Max.   :0.9880000   Max.   :0.894   Max.   :1355938   Max.   :0.9960  
##                                                                        
##  instrumentalness        key         liveness         loudness       mode   
##  Min.   :0.00e+00   9      : 95   Min.   :0.0157   Min.   :-18.148   0:203  
##  1st Qu.:4.90e-06   2      : 94   1st Qu.:0.0989   1st Qu.: -8.086   1:490  
##  Median :3.21e-04   7      : 84   Median :0.1240   Median : -6.253          
##  Mean   :6.25e-02   0      : 81   Mean   :0.2004   Mean   : -6.645          
##  3rd Qu.:1.49e-02   11     : 68   3rd Qu.:0.2320   3rd Qu.: -4.767          
##  Max.   :8.95e-01   4      : 58   Max.   :0.9980   Max.   : -1.574          
##                     (Other):213                                             
##      tempo        time_signature    valence      
##  Min.   : 48.58   1:  2          Min.   :0.0494  
##  1st Qu.: 99.19   3: 37          1st Qu.:0.3050  
##  Median :121.14   4:649          Median :0.4640  
##  Mean   :123.28   5:  5          Mean   :0.4725  
##  3rd Qu.:141.93                  3rd Qu.:0.6310  
##  Max.   :205.85                  Max.   :0.9730  
## 
# Checking if the ratio is preserved
sum(data$IsWinner == 1)/ sum(data$IsWinner == 0)
## [1] 0.2159888
sum(training_set$IsWinner == 1)/ sum(training_set$IsWinner == 0)
## [1] 0.2309059
training_set
## # A tibble: 693 × 16
##    names   IsWinner  Year followers acousticness danceability duration_ms energy
##    <chr>      <int> <int>     <int>        <dbl>        <dbl>       <int>  <dbl>
##  1 Nightm…        0  2010   6262809     0.000318        0.554      374453  0.949
##  2 I'd Do…        0  1993   1034322     0.465           0.366      718600  0.561
##  3 Patien…        1  2022   4802169     0.000195        0.318      441402  0.87 
##  4 Someda…        1  2006   6137375     0.254           0.533      295560  0.59 
##  5 I Know…        0  2020   1775452     0.33            0.323      344693  0.323
##  6 Find M…        1  2021   4416749     0.256           0.873      293849  0.809
##  7 Weak -…        0  2017   2923531     0.118           0.67       201159  0.643
##  8 Walk O…        1  2001  11148674     0.00379         0.528      296240  0.832
##  9 Black …        1  2018   2341237     0.197           0.558      259893  0.902
## 10 Spectr…        0  2012   6399322     0.00225         0.578      218190  0.946
## # ℹ 683 more rows
## # ℹ 8 more variables: instrumentalness <dbl>, key <fct>, liveness <dbl>,
## #   loudness <dbl>, mode <fct>, tempo <dbl>, time_signature <fct>,
## #   valence <dbl>

Exploratory Data Analysis

Relationship Between Independent Variables

At first, we took a look at the continuous variables.

attach(training_set)
## The following object is masked _by_ .GlobalEnv:
## 
##     names
# Correlations between continuous variables
cor_matrix = cor(training_set[,c(-1, -2, -10, -13, -15)])
corrplot(cor_matrix)

pairs(training_set[,c(-1, -2, -10, -13, -15)], lower.panel = panel.smooth)

WHY CAN’T I GET THIS PAIRS() PDF TO INSERT??

knitr::include_graphics("yourPlot.pdf", error = FALSE)

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.